feat(bb): WebGPU field-mul bench + Karatsuba/sos3uv3 Mont mults#23341
Open
zac-williamson wants to merge 4 commits into
Open
feat(bb): WebGPU field-mul bench + Karatsuba/sos3uv3 Mont mults#23341zac-williamson wants to merge 4 commits into
zac-williamson wants to merge 4 commits into
Conversation
…ults Adds a standalone WebGPU micro-benchmark page comparing three BN254 Montgomery product implementations for chained-mul throughput: - cios (u32): mitschabaude runtime-loop CIOS over 20×13-bit limbs. Baseline, ~109 ms at n=2^20, k=100. - karat (u32): recursive Karatsuba + Yuval reduction. 9 5×5 schoolbook sub-sub-products are computed independently and combined via two Karatsuba levels; reduction uses precomputed r_inv = W^-1 mod p with zero drains in the multiply phase (unsigned wrap unwinds via subsequent subtraction). ~80 ms (~28% faster than cios). - sos3uv3 (f32, reference): 22-bit f32 limbs with separate per-slot tlo/thi accumulators that break the inner-j carry chain. Single drain per outer iter via bias_split_f32_le4w. ~79 ms. The bench harness: - bench-field-mul.html is a standalone page; reads ?path=u32|f32 &n=N&k=K&validate-n=N&reps=R&variant=V from the URL. - bench-field-mul.ts runs k chained Mont mults per thread, validates the first `validate-n` outputs against a host BigInt reference, and writes timing into window.__bench. - scripts/bench-field-mul.mjs is a Playwright driver for headless invocation from the CLI (added playwright-core as devDependency).
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
Routes the `montgomery_product_funcs` mustache partial through a
pre-rendered Karatsuba+Yuval body in every MSM shader that does a
base-field multiply (15 callsites: convert_points, smvp, horner,
batch_affine_{apply,schedule,finalize_*,init,apply_scatter},
batch_inverse{,_parallel}, bpr, decompress_g1, montgomery_parity).
The Karatsuba body benches ~27% faster than the mitschabaude
runtime-loop CIOS at n=2^20, k=100 (80 ms vs 109 ms). It exposes the
same `fn montgomery_product(x, y) -> BigInt` symbol plus the same
`get_p` / `conditional_reduce` helpers and uses the same 20×13-bit
limb layout, so the swap is a drop-in change with no callsite churn.
The field-mul bench retains both options (`?variant=cios` renders the
original template inline, `?variant=karat` reuses the class-level
default) so the two bodies can be compared side-by-side.
Phase 1 LANDED — BY safegcd inversion (fr_inv_by_a, Option A: 20×13-bit, BATCH=26, carry-free apply_matrix):
- Production swap-in: wgsl/cuzk/batch_inverse{,_parallel}.template.wgsl call fr_inv_by_a
- 1.5× faster than legacy fr_inv (Pornin K=12) at chained-inverse bench
- ~8% MSM wall reduction at logN=16 sanity check
- TS port (cuzk/bernstein_yang.ts, bernstein_yang_a.ts) + Jest tests (24 passing)
- WGSL impls: wgsl/field/by_inverse{,_a}.template.wgsl + wgsl/bigint/bigint_by.template.wgsl
Phase 2 EXPLORATORY — multi-window pooled batch_inverse + multi-window BPR:
- WPB plumbing in batch_inverse_parallel + dispatch_args + batch_affine.ts
- Default WPB=1 (= legacy behavior, no perf change)
- BPR_WINDOWS_PER_BATCH knob in bpr_bn254.template.wgsl
- Empirical: pooling without growing WG count gives 0% gain — design needs restructure
Standalone bench infrastructure:
- bench-divsteps, bench-apply-matrix, bench-fr-inv, bench-batch-affine
- Each with HTML page + TS dispatcher + Playwright runner under dev/msm-webgpu/scripts/
- profile-sanity.mjs for per-pass GPU time breakdown on the Quick Sanity Check
Tree-reduce design (Stage B) for autonomous remote execution:
- .claude/plans/msm-tree-reduce.md — full design (adaptive batch sizing, analytical slice partition, 2 distinct phase kernels)
- .claude/plans/remote-agent-brief.md — remote agent execution brief
Co-authored with Claude.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a standalone WebGPU micro-benchmark page (
bench-field-mul.html+ headless Playwright driver) that compares three BN254 Montgomery product implementations for chained-mul throughput:karat(u32, the main win)Recursive Karatsuba (20×20 → 10×10 → 5×5) over unsigned 13-bit limbs, with Yuval reduction using precomputed
r_inv = W⁻¹ mod p. Nine 5×5 schoolbook sub-sub-products are computed independently and combined via two Karatsuba levels. Zero drains in the multiply phase: a singlepp_cr_Cslot overflows u32 by ~1.25×, and the wrap unwinds correctly through subsequent unsigned subtraction (algebraic identityP_mid[m] = Σ(x_lo·y_hi + x_hi·y_lo)is non-negative per limb at lazy values). Fully unrolled via mustache so all indices are compile-time constants — naga SROAs the temp slots into registers instead of thread-private memory.sos3uv3(f32, kept as reference)22-bit f32 limbs with separate per-slot
tlo[k]/thi[k]accumulators that break the inner-j carry chain. Each j writes uniquetlo[j-1]andthi[j]so there's no overlap or RAW dependency across iterations. Single drain at end of each outer iter viabias_split_f32_le4w. The 22-bit width buys an exact 4-way sum (4·W = 2²⁴fits in the f32 mantissa).Test plan
yarn installto pick upplaywright-coredevDependency.yarn generate:wgsl && yarn build:esm.cd barretenberg/ts && ./node_modules/.bin/vite --config dev/msm-webgpu/vite.config.ts --no-open.node barretenberg/ts/dev/msm-webgpu/scripts/bench-field-mul.mjs --path u32 --n 1048576 --k 100 --variant karat --validate-n 1024 --reps 6(and--variant cios, and--path f32 --variant sos3uv3). Each should printVALIDATION OKand atiming reps=…median=…line.http://localhost:5173/dev/msm-webgpu/bench-field-mul.html?path=u32&variant=karat&n=1048576&k=100&validate-n=1024&reps=6and readwindow.__bench.